16 research outputs found
3D Object Reconstruction from Hand-Object Interactions
Recent advances have enabled 3d object reconstruction approaches using a
single off-the-shelf RGB-D camera. Although these approaches are successful for
a wide range of object classes, they rely on stable and distinctive geometric
or texture features. Many objects like mechanical parts, toys, household or
decorative articles, however, are textureless and characterized by minimalistic
shapes that are simple and symmetric. Existing in-hand scanning systems and 3d
reconstruction techniques fail for such symmetric objects in the absence of
highly distinctive features. In this work, we show that extracting 3d hand
motion for in-hand scanning effectively facilitates the reconstruction of even
featureless and highly symmetric objects and we present an approach that fuses
the rich additional information of hands into a 3d reconstruction pipeline,
significantly contributing to the state-of-the-art of in-hand scanning.Comment: International Conference on Computer Vision (ICCV) 2015,
http://files.is.tue.mpg.de/dtzionas/In-Hand-Scannin
Capturing Hand-Object Interaction and Reconstruction of Manipulated Objects
Hand motion capture with an RGB-D sensor gained recently a lot of research attention, however, even most recent approaches focus on the case of a single isolated hand. We focus instead on hands that interact with other hands or with a rigid or articulated object. Our framework successfully captures motion in such scenarios by combining a generative model with discriminatively trained salient points, collision detection and physics simulation to achieve a low tracking error with physically plausible poses. All components are unified in a single objective function that can be optimized with standard optimization techniques. We initially assume a-priori knowledge of the object’s shape and skeleton. In case of unknown object shape there are existing 3d reconstruction methods that capitalize on distinctive geometric or texture features. These methods though fail for textureless and highly symmetric objects like household articles, mechanical parts or toys. We show that extracting 3d hand motion for in-hand scanning e↵ectively facilitates the reconstruction of such objects and we fuse the rich additional information of hands into a 3d reconstruction pipeline. Finally, although shape reconstruction is enough for rigid objects, there is a lack of tools that build rigged models of articulated objects that deform realistically using RGB-D data. We propose a method that creates a fully rigged model consisting of a watertight mesh, embedded skeleton and skinning weights by employing a combination of deformable mesh tracking, motion segmentation based on spectral clustering and skeletonization based on mean curvature flow
Capturing Hands in Action using Discriminative Salient Points and Physics Simulation
Hand motion capture is a popular research field, recently gaining more
attention due to the ubiquity of RGB-D sensors. However, even most recent
approaches focus on the case of a single isolated hand. In this work, we focus
on hands that interact with other hands or objects and present a framework that
successfully captures motion in such interaction scenarios for both rigid and
articulated objects. Our framework combines a generative model with
discriminatively trained salient points to achieve a low tracking error and
with collision detection and physics simulation to achieve physically plausible
estimates even in case of occlusions and missing visual data. Since all
components are unified in a single objective function which is almost
everywhere differentiable, it can be optimized with standard optimization
techniques. Our approach works for monocular RGB-D sequences as well as setups
with multiple synchronized RGB cameras. For a qualitative and quantitative
evaluation, we captured 29 sequences with a large variety of interactions and
up to 150 degrees of freedom.Comment: Accepted for publication by the International Journal of Computer
Vision (IJCV) on 16.02.2016 (submitted on 17.10.14). A combination into a
single framework of an ECCV'12 multicamera-RGB and a monocular-RGBD GCPR'14
hand tracking paper with several extensions, additional experiments and
detail
POCO: 3D Pose and Shape Estimation with Confidence
The regression of 3D Human Pose and Shape (HPS) from an image is becoming
increasingly accurate. This makes the results useful for downstream tasks like
human action recognition or 3D graphics. Yet, no regressor is perfect, and
accuracy can be affected by ambiguous image evidence or by poses and appearance
that are unseen during training. Most current HPS regressors, however, do not
report the confidence of their outputs, meaning that downstream tasks cannot
differentiate accurate estimates from inaccurate ones. To address this, we
develop POCO, a novel framework for training HPS regressors to estimate not
only a 3D human body, but also their confidence, in a single feed-forward pass.
Specifically, POCO estimates both the 3D body pose and a per-sample variance.
The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing
uncertainty that is highly correlated to pose reconstruction quality. The POCO
framework can be applied to any HPS regressor and here we evaluate it by
modifying HMR, PARE, and CLIFF. In all cases, training the network to reason
about uncertainty helps it learn to more accurately estimate 3D pose. While
this was not our goal, the improvement is modest but consistent. Our main
motivation is to provide uncertainty estimates for downstream tasks; we
demonstrate this in two ways: (1) We use the confidence estimates to bootstrap
HPS training. Given unlabelled image data, we take the confident estimates of a
POCO-trained regressor as pseudo ground truth. Retraining with this
automatically-curated data improves accuracy. (2) We exploit uncertainty in
video pose estimation by automatically identifying uncertain frames (e.g. due
to occlusion) and inpainting these from confident frames. Code and models will
be available for research at https://poco.is.tue.mpg.de
Expressive Body Capture: 3D Hands, Face, and Body from a Single Image
To facilitate the analysis of human actions, interactions and emotions, we
compute a 3D model of human body pose, hand pose, and facial expression from a
single monocular image. To achieve this, we use thousands of 3D scans to train
a new, unified, 3D model of the human body, SMPL-X, that extends SMPL with
fully articulated hands and an expressive face. Learning to regress the
parameters of SMPL-X directly from images is challenging without paired images
and 3D ground truth. Consequently, we follow the approach of SMPLify, which
estimates 2D features and then optimizes model parameters to fit the features.
We improve on SMPLify in several significant ways: (1) we detect 2D features
corresponding to the face, hands, and feet and fit the full SMPL-X model to
these; (2) we train a new neural network pose prior using a large MoCap
dataset; (3) we define a new interpenetration penalty that is both fast and
accurate; (4) we automatically detect gender and the appropriate body models
(male, female, or neutral); (5) our PyTorch implementation achieves a speedup
of more than 8x over Chumpy. We use the new method, SMPLify-X, to fit SMPL-X to
both controlled images and images in the wild. We evaluate 3D accuracy on a new
curated dataset comprising 100 images with pseudo ground-truth. This is a step
towards automatic expressive human capture from monocular RGB data. The models,
code, and data are available for research purposes at
https://smpl-x.is.tue.mpg.de.Comment: To appear in CVPR 201
Populating 3D Scenes by Learning Human-Scene Interaction
Humans live within a 3D space and constantly interact with it to perform
tasks. Such interactions involve physical contact between surfaces that is
semantically meaningful. Our goal is to learn how humans interact with scenes
and leverage this to enable virtual characters to do the same. To that end, we
introduce a novel Human-Scene Interaction (HSI) model that encodes proximal
relationships, called POSA for "Pose with prOximitieS and contActs". The
representation of interaction is body-centric, which enables it to generalize
to new scenes. Specifically, POSA augments the SMPL-X parametric human body
model such that, for every mesh vertex, it encodes (a) the contact probability
with the scene surface and (b) the corresponding semantic scene label. We learn
POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX
dataset, which contains SMPL-X meshes of people interacting with 3D scenes, and
the corresponding scene semantics from the PROX-E dataset. We demonstrate the
value of POSA with two applications. First, we automatically place 3D scans of
people in scenes. We use a SMPL-X model fit to the scan as a proxy and then
find its most likely placement in 3D. POSA provides an effective representation
to search for "affordances" in the scene that match the likely contact
relationships for that pose. We perform a perceptual study that shows
significant improvement over the state of the art on this task. Second, we show
that POSA's learned representation of body-scene interaction supports monocular
human pose estimation that is consistent with a 3D scene, improving on the
state of the art. Our model and code are available for research purposes at
https://posa.is.tue.mpg.de
3D Human Pose Estimation via Intuitive Physics
Estimating 3D humans from images often produces implausible bodies that lean,
float, or penetrate the floor. Such methods ignore the fact that bodies are
typically supported by the scene. A physics engine can be used to enforce
physical plausibility, but these are not differentiable, rely on unrealistic
proxy bodies, and are difficult to integrate into existing optimization and
learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms
that can be inferred from a 3D SMPL body interacting with the scene. Inspired
by biomechanics, we infer the pressure heatmap on the body, the Center of
Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With
these, we develop IPMAN, to estimate a 3D body from a color image in a "stable"
configuration by encouraging plausible floor contact and overlapping CoP and
CoM. Our IP terms are intuitive, easy to implement, fast to compute,
differentiable, and can be integrated into existing optimization and regression
methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with
synchronized multi-view images, ground-truth 3D bodies with complex poses,
body-floor contact, CoM and pressure. IPMAN produces more plausible results
than the state of the art, improving accuracy for static poses, while not
hurting dynamic ones. Code and data are available for research at
https://ipman.is.tue.mpg.de.Comment: Accepted in CVPR'23. Project page: https://ipman.is.tue.mpg.d